Consider the following model:
Likelihood:
\[ y_i \sim Normal(\mu, \sigma = 1) \]
Prior:
\[ \begin{aligned} \mu &\sim Normal(0, 1) \\ \sigma &\sim Exponential(1) \end{aligned} \]
For very simple models, like this one, we can write and evaluate the posterior directly. We can then use these calculated values to plot the full posterior.
Lets write this posterior in R:
set.seed(2)
# Data:
y = rnorm(100, mean = 2, sd = 1)
# Posterior, in log scale for numerical stability
logP_theta_y = function(mu, sigma = 1, data = y){
ll = sum(dnorm(data, mu, sigma, log = TRUE))
logprior = dnorm(mu, 0, 1, log = TRUE) + dexp(sigma, 1, log = TRUE)
logprior + ll # P(theta)P(y|theta)
}
In this univariate case it’s easy to check many possible values of \(\mu\) to find the high probability values, but in high dimensions this is not feasible.
The rethinking syntax closely mirrors the mathematical notation we have been using in class. For example, a simple linear regression of \(y_i\) on \(x_i\) using some standard-ish priors can be defined as:
\[ \begin{aligned} y_i &\sim Normal(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta x_i \\ \alpha &\sim Normal(0, 1) \\ \beta &\sim Normal(0, 1) \\ \sigma &\sim Exponential(1) \\ \end{aligned} \] First, let’s do the exercise of checking the assumed distriburions for the three parameters.
A normal distribution with \(\mu = 0\) and \(\sigma = 1\):
An exponential distribution with \(\lambda = 1\)
First we simulate some data:
N = 100 # sample size
# Linear regression parameters
alpha = 1 # Intercept
beta = 1 # slope on x
## Simulate a normal response that has mu as a linear function of a predictor
x = rnorm(N) # Simulated independent variable. Mean = 0 and sd = 1
mu = alpha + beta * x # linear model for the expected value
y = rnorm(N, mu) # Response variable, with residual sd = 1
An now we fit our model, that translates to the following code:
library(rethinking)
## Rethinking code
m1 = ulam(alist(
y ~ normal(mu, sigma), # Likelihood
mu <- a + b * x, # Linear model
a ~ normal(0, 1), # Prior on intercept
b ~ normal(0, 1), # Prior on slope
sigma ~ exponential(1)), # Prior on residual variance
data = list(y = y, x = x), # Data list
iter = 2000, chains = 4, cores = 4, refresh = 0, cmdstan = TRUE) # Sampler control parameters
The precis() function generates a simple summary table of fitted coefficients:
precis(m1, digits = 2) # Summary of model fit
Exercise: Fit the same model using the lm() function in R, and compare the summary of the two fitted models.
If the sample size is large, the likelihood overwhelms the prior and even strong priors can lead to similar inferences.
Exercise: Change the priors. Can you find a prior that substantially changes the estimates? Next, reduce the sample size (change the N variable). Do the priors make a difference now?
Some models are too complicated to fit in the simple rethinking syntax. For these models we have to write Stan code directly. We can start to see how these models are structured by looking at the Stan code rethinking generates.
writeLines(m1@model)
Note the different code blocks for declaring the different parts of the model (data, parameters, and model). Also note the need to declare all variables and their domain constraints. We will see more complicated Stan models later in the course.
Running multiple independent MCMC chains allows us to check if the chains converged to the same parameter values. Ideally, all chains should jump around the same values, without getting stuck on some value and without much similarity between subsequent values. Let’s plot the 4 chains:
traceplot_ulam(m1)
These look fine. Notice the initial jumps far away from the typical set. This is the sampler’s initial warm-up phase, before the chains converge.
There are also diagnostics calculated on multiple chains to check for
convergence. The most common one is the R-hat diagnostic, which
measures the ratio of the within and between chain variance. If the
chains converged to the same value, R-hat should be very close
to 1. The precis() summary outputs the R-hat for all parameters by
default.
There are other more advanced diagnostics in Stan models, which can be very powerful and allow us to identify ill behaved models. See this post for more.
When we are using simulated data, we can compare the posterior estimates to the known parameters. The bayesplot package provides several functions for visualizing the posterior:
library(bayesplot)
samples = as.data.frame(extract.samples(m1))
mcmc_recover_hist(samples[,c("a", "b")] , true = c(alpha, beta))
Sometimes the mean of the posterior does not match the simulated value exactly. This is normal for any single run so long as the simulated value is in a reasonably high probability region of the posterior. A more thorough check would be to run the simulation several times and to check model is properly calibrated. For example, we might check if the true value falls into the 50% confidence region 50% of the runs.
Challenge exercise: Make a for loop to check model calibration.
Now we can evaluate the model fit to the simulated data. First, let’s draw a few regression lines from the posterior. This will give us some idea about the uncertainty in the regression line.
plot(y ~ x, pch = 19)
for(i in 1:30)
abline(a = samples[i, "a"], b = samples[i, "b"], col = adjustcolor(2, alpha = 0.3))
We can also include the mean posterior estimate of the regression line:
While looking at the fitted model against the data is usually more informative, we can also plot the coefficient estimates directly. It is sometimes difficult to interpret coefficients directly, but with some practice numerical estimates can be useful in inference.
A basic confidence interval plot can be generated by calling the plot function on the precis() output table:
One powerful concept in Bayesian models is of simulated data from the posterior. Ideally, we would like the posterior distribution to generate simulated data that is compatible with the data used to fit the model. If we can not reproduce some relevant aspect of the data given the fitted model, it is usually a sign that the model needs to be modified.
Here, we generate 30 simulated data sets from the posterior:
One useful technique is to compare some tail percentile from the posterior predictive distribution to the observed percentile. This can identify problems at the extreme tails of the data distribution, where it’s harder to get a good fit.
Let’s use simulations to reproduce the example we saw in class. We used prior simulations to understand the consequences of the priors we were using.
Let’s do the simulations here, and them compare the prior to the posterior, after fitting the data:
library(ggplot2)
library(patchwork)
data(Howell1) # get the data
d <- Howell1
d2 <- d[d$age >= 18, ] # filter adults
xbar <- mean(d2$weight) # calculate the mean weight
d2$weight <- d2$weight - mean(d2$weight)
x_range = range(d2$weight) # calculate the range of weights
# Simulating from the prior
set.seed(2971)
N <- 100 # 100 prior samples
a <- rnorm(N, mean(d2$height), 20 ) # Samples of the posterior distribution of the intercept
b <- rlnorm(N, 0, 1) # Samples of the posterior distribution of the slope
data = data.frame(a, b)
# Visualizing the prior
p.prior = ggplot() +
ylim(-100, 400) +
geom_abline(data = data, aes(slope = b, intercept = a), alpha = 0.2) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 272) +
scale_x_continuous(labels = c(-10, 0, 10) + 45, breaks = c(-10, 0, 10), limits = x_range) +
theme_minimal() + labs(x = "weight (kg)", y = "height (cm)") +
geom_vline(xintercept = xbar, linewidth = 1) +
ggtitle("log(b) ~ Normal(0, 1)") +
annotate("text", x = -13, y = 274, label = "Tallest person ever (272cm)",
hjust = 0, vjust = 0) + ggtitle("Prior")
p.prior
# Model:
fit = ulam(alist(
y ~ normal(mu, sigma),
mu <- a + b * x,
a ~ normal(0, 20),
b ~ lognormal(0, 1),
sigma ~ exponential(1)),
data = list(y = d2$height,
x = d2$weight),
iter = 2000, chains = 4, cores = 4, cmdstan = TRUE)
# extracting the samples
samples = as.data.frame(extract.samples(fit))
e_b = mean(samples$b) # Mean posterior estimate of b
e_a = mean(samples$a) # Mean posterior estimate of a
# Plotting the posterior
p.posterior = ggplot() +
ylim(-100,400) +
geom_point(data = d2, aes(x = weight, y = height), alpha = 0.1, shape = 19) +
geom_abline(data = samples[1:100,], aes(slope = b, intercept = a), alpha = 0.1) +
geom_abline(slope = e_b, intercept = e_a, alpha = 1, col = 2) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = 272) +
scale_x_continuous(labels = c(-10, 0, 10) + 45, breaks = c(-10, 0, 10), limits = x_range) +
theme_minimal() + labs(x = "weight (kg)", y = "height (cm)") +
geom_vline(xintercept = xbar, linewidth = 1) +
ggtitle("log(b) ~ Normal(0, 1)") +
annotate("text", x = -13, y = 274, label = "Tallest person ever (272cm)",
hjust = 0, vjust = 0) + ggtitle("Posterior")
# comparing the prior and the posterior
p.prior + p.posterior
If you want to run this tutorial in your own computer (and not the server), you will need to install the rethinking package.
First we need to install rstan:
install.packages("rstan", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Next, the rethinking package:
install.packages(c("coda","mvtnorm","remotes", "loo", "bayesplot"))
library(remotes)
remotes::install_github("rmcelreath/rethinking")
cmdstanR is also a great way to have a more stable R environment:
# we recommend running this is a fresh R session or restarting your current session
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Next, install cmdstan:
Now we should be good to go!